{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bf93dc2a",
   "metadata": {},
   "source": [
    "# Practice of Key Information Extraction in Contract Scenarios Based on ERNIE-4.5-0.3B and PaddleOCR\n",
    "\n",
    "## 1. Background Overview\n",
    "\n",
    "Key Information Extraction (KIE) is an important AI task in the industry, aiming to automatically extract key, structured information from unstructured data such as text and images. This technology is crucial for quickly identifying and extracting important information from large volumes of data, especially when processing complex documents.\n",
    "\n",
    "To address these challenges, PaddleOCR has launched PP-ChatOCRv4, a high-precision key information extraction solution that combines large language models (LLM), multimodal large models (MLLM), and OCR technology. PP-ChatOCRv4 provides a one-stop solution capable of handling complex document information extraction tasks such as layout analysis, rare character recognition, multi-page PDFs, tables, and seal/stamp recognition. Its core advantage lies in integrating the ERNIE large model, fusing abundant data and knowledge to improve the accuracy and applicability of information extraction.\n",
    "\n",
    "However, PP-ChatOCRv4 still faces some challenges in practical applications. Due to the huge number of parameters in its underlying large language model, deployment costs are high, which limits its widespread adoption in resource-limited environments. To address this issue, this tutorial proposes a lightweight language model-based solution using the ERNIE-4.5-0.3B model.\n",
    "\n",
    "ERNIE-4.5-0.3B is a lightweight language model in the ERNIE 4.5 series, with only 0.3B parameters, meeting the information extraction needs in resource-constrained scenarios. This tutorial uses this model as an example to introduce the fine-tuning process of lightweight language models, aiming to reduce resource consumption when using large language models in PP-ChatOCRv4, and to enhance the accuracy and practicality of small language models for information extraction in specific domains.\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/algorithm_ppchatocrv4.png\" width=\"800\"/>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70963396",
   "metadata": {},
   "source": [
    "## 2. Environment Preparation\n",
    "\n",
    "### 2.1 Install the PaddlePaddle Framework\n",
    "\n",
    "In this example, multiple PaddlePaddle deep learning models will be used to perform layout analysis on contract documents. Therefore, you need to install the PaddlePaddle framework first. Please refer to the [installation guide](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html) to complete the installation. An example command is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4725f95",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89d7e63f",
   "metadata": {},
   "source": [
    "### 2.2 Install PaddleOCR\n",
    "\n",
    "To generate question-answer pairs based on the layout information of contracts, we need to use the PP-StructureV3 layout analysis tool. PP-StructureV3 is a document parsing solution launched by PaddleOCR, capable of analyzing complex document data.\n",
    "\n",
    "The PP-StructureV3 document image analysis tool used in this example is already integrated into PaddleOCR, so you need to install PaddleOCR.\n",
    "\n",
    "PaddleOCR provides precompiled Python packages that can be installed with a single command. The installation command is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64ffc43a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install paddleocr==3.0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccbd8d99",
   "metadata": {},
   "source": [
    "### 2.3 Install Dependencies Required for ERNIE-4.5-0.3B\n",
    "\n",
    "ERNIE-4.5-0.3B is a lightweight language model in the ERNIE 4.5 series. This model depends on the ERNIE codebase. The installation method is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "019ba931",
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/PaddlePaddle/ERNIE\n",
    "!python -m pip install -r requirements.txt\n",
    "!python -m pip install -e . # We recommend install in editable mode\n",
    "! pip install --upgrade opencv-python opencv-python-headless"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5270555c",
   "metadata": {},
   "source": [
    "## 3. Deployment and Evaluation of ERNIE-4.5-0.3B\n",
    "\n",
    "Here, we take contract data as an example to deploy and evaluate the ERNIE-4.5-0.3B model.\n",
    "\n",
    "### 3.1 Deploy ERNIE 4.5 and Set Key Parameters\n",
    "\n",
    "In this example, the ERNIE large model is called via service requests, so it needs to be deployed as a local service. The deployment of the ERNIE large model can be completed using the FastDeploy tool. FastDeploy is an open-source inference deployment tool from PaddlePaddle designed for large models. For deployment methods, please refer to the [FastDeploy official documentation](https://github.com/PaddlePaddle/FastDeploy).\n",
    "\n",
    "After deploying FastDeploy as a backend service, you need to enter the service URL in the configuration below, and use the script to test the service. If the output contains \"Test successful!\", it means the service is available. Otherwise, the service is unavailable. Please troubleshoot according to the error message."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6db426f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please fill in the URL of the local service below, e.g., http://0.0.0.0:8000/v1\n",
    "ERNIE_URL = \"\"\n",
    "\n",
    "try:\n",
    "    import openai\n",
    "\n",
    "    client = openai.OpenAI(base_url=ERNIE_URL, api_key=\"api_key\")\n",
    "    question = \"Who are you?\"\n",
    "    response1 = client.chat.completions.create(\n",
    "        model=\"xxx\", messages=[{\"role\": \"user\", \"content\": question}]\n",
    "    )\n",
    "    reply = response1.choices[0].message.content\n",
    "except Exception as e:\n",
    "    print(f\"Test failed! The error message is:\\n{e}\")\n",
    "\n",
    "print(f\"Test succeeded!\\nThe question is: {question}\\nThe answer is: {reply}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d44afa2e",
   "metadata": {},
   "source": [
    "### 3.2 Evaluation of ERNIE-4.5-0.3B's Information Extraction Capability on the Contract Dataset\n",
    "\n",
    "Information extraction effectiveness evaluation refers to the systematic measurement and analysis of the accuracy, completeness, and practicality of structured information automatically extracted from text. The core of the evaluation is to determine to what extent the extraction system can accurately identify and capture key information in the text while avoiding errors and omissions. This process is crucial to ensuring the reliability of the information extraction system, as it directly affects the quality of subsequent applications and the accuracy of decision-making. Through rigorous evaluation, shortcomings in the system can be identified, guiding improvements to algorithms and models to enhance their performance in real-world scenarios. In addition, information extraction evaluation provides an objective standard to compare the strengths and weaknesses of different systems or methods, helping researchers and practitioners choose the solution best suited to their needs. Therefore, in the field of information extraction, effectiveness evaluation is not only a necessary step to verify system performance, but also a key link in driving technological progress and the success of applications.\n",
    "\n",
    "Below, we provide a detailed introduction to the evaluation process of ERNIE-4.5-0.3B's information extraction performance.\n",
    "\n",
    "First, execute the following command to download the datasets and code packages required for training and evaluating ERNIE-4.5-0.3B."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c92a812d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://paddle-model-ecology.bj.bcebos.com/paddlex/PaddleX3.0/DemosWithERNIE/EB_03B_contract.tar\n",
    "!tar -xf EB_03B_contract.tar"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df9e6585",
   "metadata": {},
   "source": [
    "Next, execute the following command to invoke PP-ChatOCRv4 and the deployed ERNIE-4.5-0.3B large language model for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f85df145",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python EB_03B_contract/tools/contract_predict.py \\\n",
    "    --gt_file_path \"EB_03B_contract/contract_val.json\" \\\n",
    "    --output_path \"EB_03B_contract/contract_image/eb_03B_baseline_pred.txt\" \\\n",
    "    --img_dir \"EB_03B_contract/contract_image/contract_val\" \\\n",
    "    --num_gpus 1 \\\n",
    "    --processes_per_gpu 1 \\\n",
    "    --ernie_model_name \"xxx\" \\\n",
    "    --base_url \"http://0.0.0.0:8178/v1\" \\\n",
    "    --api_key \"sk-xxxxxx...\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1630fd40",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `gt_file_path`: The file path of the converted labeled evaluation set;\n",
    "* `output_path`: The save path for the prediction results of PP-ChatOCRv4;\n",
    "* `num_gpus`: The number of GPUs used to generate question-answer pairs;\n",
    "* `processes_per_gpu`: The number of processes launched per GPU;\n",
    "* `ernie_model_name`: The model name for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `base_url`: The URL address for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `api_key`: The API key for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "\n",
    "Executing the above command will generate a txt file of the prediction results for key information extraction, as detailed below:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/2_2_baseline_pred.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "Finally, after obtaining the inference results, execute the following command to transform and evaluate the prediction results of the trained model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "752548c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python EB_03B_contract/tools/convert2ppchat_result.py \\\n",
    "    --predict_ori_path \"EB_03B_contract/contract_image/eb_03B_baseline_pred.txt\" \\\n",
    "    --predict_new_path \"EB_03B_contract/contract_image/predict_res_baseline_fix.json\"\n",
    "!python EB_03B_contract/tools/contract_dataset_eval.py \\\n",
    "    --gt_file_path \"EB_03B_contract/contract_val.json\" \\\n",
    "    --predict_file_path \"EB_03B_contract/contract_image/predict_res_baseline_fix.json\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e02ee4e6",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `predict_ori_path`: The input path for the prediction results of PP-ChatOCRv4;\n",
    "* `predict_new_path`: The output path for the prediction results converted to comply with the PP-ChatOCRv4 specification;\n",
    "* `gt_file_path`: The path for the evaluation set annotation file converted to comply with the PP-ChatOCRv4 specification;\n",
    "* `predict_file_path`: The path for the evaluation set prediction file converted to comply with the PP-ChatOCRv4 specification;\n",
    "\n",
    "After executing the above command, the model's evaluation results will be printed, as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/2_2_baseline_eval_rst.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "The scores after evaluation are as follows:\n",
    "\n",
    "| Comparison Method          | Recall Score |\n",
    "|---------------------------|--------------|\n",
    "| ERNIE-4.5-0.3B            | 0.07         |\n",
    "\n",
    "## 4. Building the Fine-tuning Dataset for ERNIE-4.5-0.3B\n",
    "\n",
    "When constructing the fine-tuning dataset for the ERNIE-4.5-0.3B model, the quality and adaptability of the data are of utmost importance, particularly for domain-specific tasks such as contract data processing. High-quality data not only enhances the model's performance but also ensures its practicality and accuracy in specific application scenarios. Traditionally, constructing high-quality question-answer pairs (QA pairs) relies on manual annotation, a process that is not only time-consuming but may also lead to data inconsistency or bias due to human subjectivity. Additionally, the cost of manual annotation is high, especially when large-scale datasets are required. To overcome these challenges, we have introduced a semi-automated process for constructing QA pairs. Specifically, we utilize a pre-trained ERNIE large model to automatically generate initial QA pairs. This step reduces the need for human intervention while ensuring the linguistic and semantic consistency of the generated QA pairs through the model's understanding capabilities. Subsequently, we can conduct manual review and correction of the automatically generated results to further improve the quality of the dataset. Through this approach, we are able to efficiently construct a fine-tuning dataset suitable for contract data, thereby providing a solid foundation for the ERNIE-4.5-0.3B model to excel in handling contract-related tasks.\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_01_framework.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "### 4.1 Generating QA Pairs Based on Large Language Models\n",
    "Below, we use contract data as an example to introduce the construction process of the ERNIE-4.5-0.3B fine-tuning dataset. First, download the sample contract data. Then, execute the following command to generate QA pairs using the ERNIE large language model. Here, we provide two methods for invoking large language models to construct QA pairs: constructing QA pairs based on a locally deployed large model and constructing QA pairs by invoking a large model on the Baidu Intelligent Cloud Qianfan platform. Among them, the ERNIE large language model is deployed using the FastDeploy Serving approach for large models.\n",
    "It should be noted that this tutorial is primarily developed for the paddleocr3.0.2 version and may not be applicable to other versions of paddleocr.\n",
    "\n",
    "(1) The script for constructing QA pairs through a locally deployed large model is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f26c96c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate QA pairs for the training set\n",
    "!python EB_03B_contract/tools/gen_qa_pairs.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_train\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"http://0.0.0.0:8178/v1\" \\\n",
    "     --api_key \"sk-xxxxxx...\" \\\n",
    "     --qa_pair_per_image 5 \\\n",
    "     --is_test True\n",
    "# Generate QA pairs for the test set\n",
    "!python EB_03B_contract/tools/gen_qa_pairs.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_val\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"http://0.0.0.0:8178/v1\" \\\n",
    "     --api_key \"sk-xxxxxx...\" \\\n",
    "     --qa_pair_per_image 5 \\\n",
    "     --is_test True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d45739d",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `input_dir`: The directory of input contract images;\n",
    "* `output_dir`: The directory of output JSON files containing question-answer pairs;\n",
    "* `num_gpus`: The number of GPUs used to generate question-answer pairs;\n",
    "* `processes_per_gpu`: The number of processes launched per GPU;\n",
    "* `ernie_model_name`: The model name for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `base_url`: The URL address for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `api_key`: The API key for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `qa_pair_per_image`: The number of question-answer pairs generated per contract image;\n",
    "* `is_test`: Whether to perform a run-through test. When set to True, question-answer pairs will only be generated for the first image. When generating actual QA datasets, this parameter should be set to False.\n",
    "\n",
    "After executing the command, a series of JSON files containing image paths and question-answer pairs will be generated in the EB_03B_contract/contract_image/contract_qa_pairs_train and EB_03B_contract/contract_image/contract_qa_pairs_val folders. The content of the files is specifically as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_01_local.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "(2) The script for calling a large model to construct QA pairs based on Baidu Intelligent Cloud Qianfan platform is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09b0db18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please fill in the Qianfan API KEY below\n",
    "API_KEY = \"Qianfan API KEY\"\n",
    "# Generate QA pairs for the training set\n",
    "!python EB_03B_contract/tools/gen_qa_pairs.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_train\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"https://qianfan.baidubce.com/v2\" \\\n",
    "     --api_key {API_KEY} \\\n",
    "     --qa_pair_per_image 5 \\\n",
    "     --is_test True\n",
    "# Generate QA pairs for the test set\n",
    "!python EB_03B_contract/tools/gen_qa_pairs.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_val\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"https://qianfan.baidubce.com/v2\" \\\n",
    "     --api_key {API_KEY} \\\n",
    "     --qa_pair_per_image 5 \\\n",
    "     --is_test True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "290bb62f",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `input_dir`: The directory of input contract images;\n",
    "* `output_dir`: The directory of output JSON files containing question-answer pairs;\n",
    "* `num_gpus`: The number of GPUs used to generate question-answer pairs;\n",
    "* `processes_per_gpu`: The number of processes launched per GPU;\n",
    "* `ernie_model_name`: The model name for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `base_url`: The URL address for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `api_key`: The API key for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `qa_pair_per_image`: The number of question-answer pairs generated per contract image;\n",
    "* `is_test`: Whether to perform a run-through test. When set to True, question-answer pairs will only be generated for the first image. When generating actual QA datasets, this parameter should be set to False.\n",
    "\n",
    "After executing the command, a series of JSON files containing image paths and question-answer pairs will be generated in the EB_03B_contract/contract_image/contract_qa_pairs_train and EB_03B_contract/contract_image/contract_qa_pairs_val folders. The specific content of the files is as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_01_qifan.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "### 4.2 Correcting QA Pairs by Incorporating Layout Information and Prompt Optimization into Questions\n",
    "\n",
    "During the process of generating question-answering (QA) pairs using large models, despite the use of advanced models and algorithms, issues such as hallucinations and information redundancy may still arise in the results. Hallucinations refer to information generated by the model that does not align with the actual content or is entirely fabricated, while information redundancy refers to the inclusion of unnecessary or repetitive information in the generated content, both of which can affect the quality and practicality of the QA pairs.\n",
    "\n",
    "To optimize these generated QA pairs, we introduce layout context information into the questions and refine the constructed QA dataset through prompt optimization. Specifically, layout information provides details about the position, format, and structure of the text within a document. Leveraging this information can assist the model in better understanding and parsing the text content. For instance, clauses in a contract are typically presented in a certain fixed format, and layout information can help identify these structures and ensure that the generated QA pairs remain consistent with the original text. Prompt optimization, on the other hand, can guide and constrain the model's generation process. By incorporating appropriate prompts during the generation process, the model can better focus on the key parts of the text, reducing the likelihood of generating irrelevant or redundant information. Prompts can be keywords related to the contract's subject matter or important terms concerning legal clauses, all of which can help the model generate more accurate QA pairs.\n",
    "\n",
    "Here, we also provide two methods for invoking large language models to correct QA pairs: correction based on a locally deployed large model and correction based on invoking a large model through Baidu Intelligent Cloud's Qianfan platform.\n",
    "\n",
    "(1) The script for dataset correction using a locally deployed large model is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96cf6dbc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify QA pairs in the training set\n",
    "!python EB_03B_contract/tools/qa_fix.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train_fix\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"http://0.0.0.0:8178/v1\" \\\n",
    "     --api_key \"sk-xxxxxx...\" \\\n",
    "     --is_test True\n",
    "# Verify QA pairs in the training set (Note: This line is a repetition in the original text, likely for emphasis or clarity, and is translated as is.)\n",
    "!python EB_03B_contract/tools/qa_fix.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val_fix\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"http://0.0.0.0:8178/v1\" \\\n",
    "     --api_key \"sk-xxxxxx...\" \\\n",
    "     --is_test True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc861e4d",
   "metadata": {},
   "source": [
    "Executing the command will generate a series of corrected json files containing image paths and question-answer pairs in the directories EB_03B_contract/contract_image/contract_qa_pairs_train_fix and EB_03B_contract/contract_image/contract_qa_pairs_val_fix. The specific content of the files is as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_02_local.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `input_dir`: The input directory where the generated QA pair JSON files are located;\n",
    "* `output_dir`: The output directory for the corrected QA pairs;\n",
    "* `num_gpus`: The number of GPUs used to generate QA pairs;\n",
    "* `processes_per_gpu`: The number of processes launched per GPU;\n",
    "* `ernie_model_name`: The model name for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `base_url`: The URL address for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `api_key`: The API key for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `is_test`: Whether to conduct a connectivity test. If set to True, only the first QA pair JSON file will be corrected;\n",
    "\n",
    "(2) The script for dataset correction by invoking a large model based on the Baidu Intelligent Cloud Qianfan platform is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccb58963",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please fill in the Qianfan API KEY below\n",
    "API_KEY = \"Qianfan API KEY\"\n",
    "# Validate QA pairs in the training set,\n",
    "!python EB_03B_contract/tools/qa_fix.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train_fix\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"https://qianfan.baidubce.com/v2\" \\\n",
    "     --api_key {API_KEY} \\\n",
    "     --is_test True\n",
    "# Validate QA pairs in the training set\n",
    "!python EB_03B_contract/tools/qa_fix.py \\\n",
    "     --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val\" \\\n",
    "     --output_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val_fix\" \\\n",
    "     --num_gpus 1 \\\n",
    "     --processes_per_gpu 1 \\\n",
    "     --ernie_model_name \"xxx\" \\\n",
    "     --base_url \"https://qianfan.baidubce.com/v2\" \\\n",
    "     --api_key {API_KEY} \\\n",
    "     --is_test True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21bef21b",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `input_dir`: The input directory where the generated QA pair json files are located;\n",
    "* `output_dir`: The output directory for the corrected QA pairs;\n",
    "* `num_gpus`: The number of GPUs used to generate QA pairs;\n",
    "* `processes_per_gpu`: The number of processes launched per GPU;\n",
    "* `ernie_model_name`: The model name for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `base_url`: The URL address for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `api_key`: The API key for local deployment or Baidu Intelligent Cloud API invocation;\n",
    "* `is_test`: Whether to conduct a pass-through test. If set to True, only the first QA pair json file will be corrected;\n",
    "\n",
    "Executing the command will generate a series of corrected json files containing image paths and question-answer pairs in the EB_03B_contract/contract_image/contract_qa_pairs_train_fix and EB_03B_contract/contract_image/contract_qa_pairs_val_fix directories. The content of the files is as follows: <div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_02_qifan.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "### 4.3 Dataset Format Conversion\n",
    "Finally, after obtaining the corrected dataset, some QA pairs without contextual information or practical meaning in the images can be deleted through manual verification, so that the dataset can be closer to real-world contract information extraction scenarios. Finally, use the following script to merge the QA pairs after manual verification and convert them into data that complies with the ERNIE-4.5-0.3B training specifications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dd55c01",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the training set into the format that complies with ERNIE-4.5-0.3B specifications\n",
    "!python EB_03B_contract/tools/merge_jsonl.py \\\n",
    "    --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_train_fix\" \\\n",
    "    --output_path \"EB_03B_contract/contract_image/contract_merge_train.jsonl\"\n",
    "# Convert the test set into the format that complies with ERNIE-4.5-0.3B specifications\n",
    "!python EB_03B_contract/tools/merge_jsonl.py  \\\n",
    "    --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val_fix\" \\\n",
    "    --output_path \"EB_03B_contract/contract_image/contract_merge_val.jsonl\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2cc1d95",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `input_dir`: The input directory for the corrected QA pair json files;\n",
    "* `output_dir`: The output path for the merged and format-converted QA pairs;\n",
    "\n",
    "After executing the command, training and evaluation jsonl files that comply with the ERNIE-4.5-0.3B specification, namely EB_03B_contract/contract_image/contract_merge_train.jsonl and EB_03B_contract/contract_image/contract_merge_val.jsonl, will be generated. The specific contents are as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_03_mergejson.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "In addition, it is also necessary to generate a json file that complies with the PP-ChatOCR evaluation specification. The specific command is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8076fea3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python EB_03B_contract/tools/get_contract_eval_json.py \\\n",
    "    --input_dir \"EB_03B_contract/contract_image/contract_qa_pairs_val_fix\" \\\n",
    "    --output_path \"EB_03B_contract/contract_image/kie_contract_gt_fix.json\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5543623d",
   "metadata": {},
   "source": [
    "After executing the command, a json file EB_03B_contract/contract_image/kie_contract_gt_fix.json will be generated, and the specific content is as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/3_03_ppchatocr_gt.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "## 5. Fine-tuning the ERNIE-4.5-0.3B Model\n",
    "\n",
    "Once the training data for ERNIE-4.5-0.3B is ready, you can proceed with fine-tuning the model. It should be noted that ERNIE-4.5-0.3B requires one or more GPUs with A100 or higher computing power. Here, we have already prepared the generated question-answering dataset, which can be downloaded using the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "628572ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://paddle-model-ecology.bj.bcebos.com/paddlex/PaddleX3.0/DemosWithERNIE/contract_eb03_sft_jsonl.tar\n",
    "!tar -xf contract_eb03_sft_jsonl.tar"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa477c67",
   "metadata": {},
   "source": [
    "After that, you can use the following command to fine-tune the ERNIE-4.5-0.3B model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62c77f83",
   "metadata": {},
   "outputs": [],
   "source": [
    "# download model from huggingface\n",
    "!huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle\n",
    "# # Run on a single GPU card\n",
    "# !export CUDA_VISIBLE_DEVICES=0\n",
    "# ! erniekit train EB_03B_contract/eb03_contract.yaml\n",
    "# Run on eight GPU cards\n",
    "!export CUDA_VISIBLE_DEVICES=0,1,2,3\n",
    "! erniekit train EB_03B_contract/eb03_contract.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "509f4bb2",
   "metadata": {},
   "source": [
    "The eb03_contract.yaml is the configuration file for training. \n",
    "After the training is completed, the obtained model weight file will be saved in the output directory, as detailed below:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/4_eb_train.png\" width=\"400\"/>\n",
    "</div>\n",
    "\n",
    "## 6. Evaluation of Key Information Extraction Effectiveness of Fine-tuned ERNIE-4.5-0.3B\n",
    "\n",
    "After completing the training of ERNIE-4.5-0.3B, the fine-tuned model can also be conveniently deployed via FastDeploy. For details, refer to Section 2.1 on deploying the ERNIE-4.5-0.3B ERNIE large model. After deploying the fine-tuned model, you can execute the following command to invoke PP-ChatOCRv4 and the deployed ERNIE-4.5-0.3B large language model to evaluate the effectiveness of key information extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74cbe3ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python EB_03B_contract/tools/contract_predict.py\\\n",
    "    --gt_file_path \"EB_03B_contract/contract_val.json\" \\\n",
    "    --output_path \"EB_03B_contract/contract_image/eb_03B_pred.txt\" \\\n",
    "    --img_dir \"EB_03B_contract/contract_image/contract_val\" \\\n",
    "    --num_gpus 1 \\\n",
    "    --processes_per_gpu 1 \\\n",
    "    --ernie_model_name \"xxx\" \\\n",
    "    --base_url \"http://0.0.0.0:8178/v1\" \\\n",
    "    --api_key \"sk-xxxxxx...\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61c28d49",
   "metadata": {},
   "source": [
    "After executing the above command, you will obtain a txt file containing the prediction results of information extraction, with the specific content as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/5_eb_sft_pred.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "After obtaining the inference results, execute the following command to transform and evaluate the prediction results of the trained model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d23284e",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python EB_03B_contract/tools/convert2ppchat_result.py \\\n",
    "    --predict_ori_path \"EB_03B_contract/contract_image/eb_03B_pred.txt\" \\\n",
    "    --predict_new_path \"EB_03B_contract/contract_image/predict_res_fix.json\"\n",
    "!python EB_03B_contract/tools/contract_dataset_eval.py \\\n",
    "    --gt_file_path \"EB_03B_contract/contract_val.json\" \\\n",
    "    --predict_file_path \"EB_03B_contract/contract_image/predict_res_fix.json\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bed77e82",
   "metadata": {},
   "source": [
    "The detailed parameter descriptions of the above command are as follows:\n",
    "* `predict_ori_path`: The input path for the prediction results of PP-ChatOCRv4;\n",
    "* `predict_new_path`: The output path for the prediction results converted to comply with the PP-ChatOCRv4 specification;\n",
    "* `gt_file_path`: The path for the evaluation set annotation file converted to comply with the PP-ChatOCRv4 specification;\n",
    "* `predict_file_path`: The path for the evaluation set prediction file converted to comply with the PP-ChatOCRv4 specification;\n",
    "\n",
    "Executing the above command will print the specific evaluation results.\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/5_eb_sft_eval.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "The scores after evaluation are as follows:\n",
    "\n",
    "| Comparison Method          | Recall Score |\n",
    "|---------------------------|--------------|\n",
    "| ERNIE-4.5-0.3B            | 0.07         |\n",
    "| ERNIE-4.5-0.3B-SFT        | 83.08        |\n",
    "\n",
    "## 7 Real-case Testing\n",
    "\n",
    "Real-case testing refers to the use of real-world data and scenarios in an actual application environment to evaluate the performance and reliability of a system or model. The significance and importance of this testing method are manifested in several aspects. Firstly, real-case testing can uncover issues and challenges that are difficult to detect under idealized conditions, as it takes into account the complexity and diversity of data, as well as various anomalies that may occur. Secondly, real-case testing provides an opportunity to verify whether the system can meet user needs in actual use, thereby ensuring its effectiveness and practicality. Additionally, through real-case testing, the robustness and stability of the system can be better evaluated to ensure that it maintains good performance when faced with different environments and stress conditions. Therefore, real-case testing is an indispensable part of the system development and Serving process, which not only improves product quality but also enhances user trust and satisfaction.\n",
    "\n",
    "The specific code for real-case testing is provided below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a28b3448",
   "metadata": {},
   "outputs": [],
   "source": [
    "from paddleocr import PPChatOCRv4Doc\n",
    "\n",
    "chat_bot_config = {\n",
    "    \"module_name\": \"chat_bot\",\n",
    "    \"model_name\": \"xxx\",\n",
    "    \"base_url\": \"http://10.214.40.13:8170/v1\",\n",
    "    \"api_type\": \"openai\",\n",
    "    \"api_key\": \"sk-xxxxxx...\",  # your api_key\n",
    "}\n",
    "image_path = \"./EB_03B_contract/contract_image/contract_val/64c6cbcb_c877_4428_9d40_2ce124120eb4.pdf_3_012.jpg\"\n",
    "question = \"What is the allowable error ratio for the interior construction area within which the house price can be settled on time?\"\n",
    "pipeline = PPChatOCRv4Doc()\n",
    "\n",
    "visual_predict_res = pipeline.visual_predict(\n",
    "    input=image_path,\n",
    "    use_doc_orientation_classify=False,\n",
    "    use_doc_unwarping=False,\n",
    "    use_common_ocr=True,\n",
    "    use_seal_recognition=True,\n",
    "    use_table_recognition=True,\n",
    ")\n",
    "\n",
    "visual_info_list = []\n",
    "for res in visual_predict_res:\n",
    "    visual_info_list.append(res[\"visual_info\"])\n",
    "    layout_parsing_result = res[\"layout_parsing_result\"]\n",
    "\n",
    "\n",
    "chat_result = pipeline.chat(\n",
    "    key_list=[question],\n",
    "    visual_info=visual_info_list,\n",
    "    vector_info=None,\n",
    "    mllm_predict_info=None,\n",
    "    chat_bot_config=chat_bot_config,\n",
    "    retriever_config=None,\n",
    ")\n",
    "print(chat_result['chat_res'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "547b2661",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/6_case_show.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "After executing the above command, a dictionary with questions as keys and answers as values will be output, such as {'What is the allowable error ratio of the interior construction area for settling the house price on time?': 'Within 3% (inclusive)'}.\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-ChatOCRv4/codebook/6_model_pred.png\" width=\"800\"/>\n",
    "</div>\n",
    "\n",
    "## 8. Summary\n",
    "\n",
    "This tutorial first introduces the meaning of key information extraction, as well as the existing problems and challenges in this field. Then, it introduces the SOTA solution PP-ChatOCRv4 for key information extraction proposed by PaddleOCR and briefly elaborates on its shortcomings. In addition, it emphasizes the importance of fine-tuning large language models for key information extraction in vertical domain scenarios. Subsequently, taking the lightweight language model ERNIE-4.5-0.3B from the ERNIE 4.5 series as an example, the tutorial provides an in-depth explanation of its fine-tuning process, including steps such as environment preparation, fine-tuning dataset construction, and model training. Finally, the fine-tuned model is evaluated and tested on real-world cases to verify its effectiveness and reliability in practical applications."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "696da798",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "wenxin_03B",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
